The data and modelling objects created in this notebook can be downloaded directly to save computational time.
Users who wish to complete the exercises can download a small
template R script. Assuming you have already downloaded the
data objects above, this script will load all data objects so that the
steps used to create them are not necessary to tackle the exercises.
This tutorial relates to content covered in Lecture 4, and relies on the following packages for manipulating data, shaping time series, fitting dynamic regression models and plotting:
library(dplyr)
library(mvgam)
library(tidybayes)
library(ggplot2)
library(marginaleffects)
The data we will use for this tutorial is spawner-recruit data
for sockeye salmon (Oncorhynchus nerka) from the Bristol Bay
region in Southwest Alaska. Commercial fishermen in Alaska target this
important species using seines and gillnets for fresh or frozen fillet
sales and canning. Records for the rivers in Bristol Bay span the years
1952-2005, and are maintained within the RAM Legacy Stock
Assessment Database. Spawner estimates originated from Bristol Bay
Area Annual Management Reports from the Alaska Department of Fish and
Game, and are based on in-river spawner counts (including both tower
counts and aerial surveys). Age-specific catch rates were calculated
from records of catch and escapement according to age class. These data
were used to determine the number of recruits in each year.
Below we process the data, keeping observations for a single river and taking \(log(spawners_{t})\) to use as the offset in our models.
# access the data from the ATSA Github site
load(url('https://github.com/atsa-es/atsalibrary/raw/master/data/sockeye.RData'))
sockeye %>%
# Remove any previous grouping structures
dplyr::ungroup() %>%
# select the target river
dplyr::filter(region == "KVICHAK") %>%
# filter the data to only include years when number of spawners
# was estimated, as this will be used as the offset
dplyr::filter(!is.na(spawners)) %>%
# log the spawners variable to use as an appropriate offset
dplyr::mutate(log_spawners = log(spawners)) %>%
# add a 'time' variable
dplyr::mutate(time = dplyr::row_number()) %>%
# add a series variable (as a factor)
dplyr::mutate(series = as.factor('sockeye')) %>%
# keep variables of interest
dplyr::select(time, series, log_spawners, recruits) -> model_data
The data now contain four variables: time, the indicator
of which time step each observation belongs to series, a
factor indexing which time series each observation belongs to
log_spawners, the logged number of recorded sockeye
spawners (in thousands), measured as the number of returning spawners
minus catch at each timepoint, which will be used as an offset in models
recruits, the response variable representing the number of
estimated sockeye recruits at each timepoint (in thousands)
Check the data structure
head(model_data)
## # A tibble: 6 × 4
## time series log_spawners recruits
## <int> <fct> <dbl> <dbl>
## 1 1 sockeye 9.15 39000
## 2 2 sockeye 7.95 4090
## 3 3 sockeye 6.28 289
## 4 4 sockeye 6.52 547
## 5 5 sockeye 9.59 55300
## 6 6 sockeye 8.22 3500
dplyr::glimpse(model_data)
## Rows: 50
## Columns: 4
## $ time <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ series <fct> sockeye, sockeye, sockeye, sockeye, sockeye, sockeye, soc…
## $ log_spawners <dbl> 9.152711, 7.951559, 6.282267, 6.522093, 9.588777, 8.21878…
## $ recruits <dbl> 39000, 4090, 289, 547, 55300, 3500, 5360, 1120, 5750, 450…
Visualize the data as a heatmap to get a sense of where any
NAs are distributed (NAs are shown as red bars
in the below plot)
image(is.na(t(model_data %>%
dplyr::arrange(dplyr::desc(time)))), axes = F,
col = c('grey80', 'darkred'))
axis(3, at = seq(0,1, len = NCOL(model_data)), labels = colnames(model_data))
Plot properties of the time series, which indicate some overdispersion in the recruit count values that we’ll have to address
plot_mvgam_series(data = model_data, y = 'recruits')
mvgamWe will fit a number of models to these data to showcase the various methods that are available to us for evaluating and comparing dynamic GLMs / GAMs. We begin with a model that does not include a dynamic component, as this is always a useful starting point. It also allows us to highlight another useful feature of GLMs / GAMs when modelling count data, as we can include an offset variable that allows us to model rates of interest.
Sometimes it is more relevant to model the rate an event occurs per
some underlying unit. For example, in epidemiological modelling the
interest is often in understanding the number of disease cases per
population. But we never get to observe this rate directly. An
offset allows us to model the underlying rate by algebraically accounting for variation in the number of
possible counts in each observation unit. This works naturally in
models that use a log link function, such as the Poisson and Negative
Binomial models. Including an offset is usually straightforward in most
R modelling software, and indeed this is the case in
mvgam as well. Here we fit a model with no trend and
Negative Binomial observations, including an offset of the
log_spawners variable that we calculated above so we can
estimate the rate of recruitment per spawner. Our first model
addresses overdispersion in the observed data by using a Negative
Binomial observation model, and we only include the implicit intercept
along with an offset of log_spawners in the linear
predictor. As in previous tutorials, only 250 samples per chain are used
to keep the size of the resulting object small (though we’d always want
to use more samples in a real analysis)
model1 <- mvgam(recruits ~ offset(log_spawners),
family = nb(),
trend_model = 'None',
data = model_data,
samples = 250)
The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{recruits}_t & \sim \text{NegBinomial}(\lambda_t, \phi) \\ log(\lambda_t/\boldsymbol{spawners_{t}}) & = \alpha \\ log(\lambda_t) & = \alpha + 1*log(\boldsymbol{spawners_{t}}) \\ \alpha & \sim \text{Normal}(0, 1) \\ \phi^1 & \sim \text{StudentT}(3, 0, 0.1) \end{align*}\]
Note that this is not a classic density-dependent stock-recruitment model such as the Ricker or Beverton Holt models. Rather, this model very simplistically assumes recruits can increase without bound as a function of the number of spawners.
The inclusion of the offset is important if we wish to make
predictions on the scale of the observations. This is because there will
be no regression coefficient assigned to this variable (it’s coefficient
is fixed at 1). If you wish to manually make predictions, you’ll need to
take care to include the offset. This is returned as an attribute of the
linear predictor matrix when we call predict.gam on the
underlying mgcv model:
lpmatrix <- predict(model1$mgcv_model, type = 'lpmatrix')
attr(lpmatrix, 'model.offset')
## [1] 9.152711 7.951559 6.282267 6.522093 9.588777 8.218787 7.855545
## [8] 5.826000 6.863803 10.098232 8.237479 8.077137 7.847763 9.034796
## [15] 9.539644 7.779049 6.917706 5.424950 8.396155 9.480368 7.585789
## [22] 7.200425 8.330864 9.323669 9.769956 7.467371 7.038784 8.180321
## [29] 9.259131 8.883224 7.073270 8.711114 8.311398 9.026418 8.849371
## [36] 8.347590 8.461680 8.301522 9.028818 9.210340 7.279319 7.313220
## [43] 7.740664 8.732305 7.512071 7.003065 6.556778 7.432484 8.612503
## [50] 7.749322
To make predictions on the outcome scale, we might do something like the following:
# extract posterior estimates of the observation linear predictor betas
betas_post <- as.data.frame(model1, variable = 'betas')
# generate the linear predictor matrix for the fitted data
# (note, you an supply newdata here to predict for new scenarios)
lpmatrix <- predict(model1$mgcv_model, type = 'lpmatrix')
# set up a matrix to store linear predictor values
linpreds <- matrix(NA, nrow = NROW(betas_post), ncol = NROW(lpmatrix))
# loop across all posterior estimates
for(i in 1:NROW(betas_post)){
# multiply covariate values by the beta estimates (including intercepts)
linpreds[i, ] <- lpmatrix %*% betas_post[i, ] +
# add the offset, which will be equal to 0 if no offset was
# supplied to the model
as.vector(attr(lpmatrix, 'model.offset'))
}
# plot a histogram to see the distribution of these values
hist(linpreds,
col = 'darkred',
border = 'white',
xlab = expression(Linear~predictor~(log[mu])),
ylab = '',
yaxt = 'n',
main = '',
lwd = 2)
This same type of process underlies many of mvgam’s
prediction / plotting functions. For example, to plot partial effects of
smooth terms, the plot_mvgam_smooth function will set up a
grid of evenly-spaced hypothetical values as newdata to
predict the contribution of the smooth term to the linear predictor. It
does this by essentially ignoring the contributions of all other
predictors (setting their values to 0 in the linear predictor matrix).
The same process also underlies the functions used to plot conditional
effects and posterior contrasts (you can learn a great deal about how
these newdata grids are set up by looking over the
documentation in the marginaleffects book and by calling
?marginaleffects::datagrid
Once these predictions are made for the linear predictor, you could
then compute posterior expectations (in this case by
exponentiating the values with exp(linpreds) to “undo” the
log link). Or you could compute posterior predictions by
incorporating the uncertainty in the observation model:
# extract posterior estimates of the overdispersion parameter phi
phi_post <- as.data.frame(model1, variable = 'obs_params')
# set up a matrix to store posterior predictions
ypreds <- matrix(NA, nrow = NROW(betas_post), ncol = NROW(lpmatrix))
# loop across all posterior estimates
for(i in 1:NROW(betas_post)){
# take Negative Binomial draws, being sure to exponentiate the
# linear predictors and to use the appropriate overdispersion
# estimate for the 'size' argument
ypreds[i, ] <- rnbinom(n = NROW(lpmatrix),
mu = exp(linpreds[i, ]),
size = phi_post[i,])
}
# plot a histogram to see the distribution of these values
hist(ypreds,
xlim = c(0, quantile(as.vector(ypreds), probs = 0.95)),
breaks = 150,
col = 'darkred',
border = 'white',
xlab = 'Posterior predictions',
ylab = '',
yaxt = 'n',
main = '',
lwd = 2)
mvgam attempts to make this process less error-prone and
more straightforward using the predict.mvgam,
hindcast.mvgam and forecast.mvgam functions.
For example, we could replicate the linear prediction calculations using
the following:
linpreds2 <- predict(model1,
type = 'link',
process_error = FALSE)
all.equal(linpreds, linpreds2)
## [1] TRUE
Notice how we explicitly set process_error to FALSE so
that the calculations of the linear predictor will ignore any
contributions of dynamic trends. This isn’t applicable in this example,
but it becomes important to consider when making predictions from
dynamic models. We could also compute expectations using
predict.mvgam:
linpreds3 <- predict(model1, type = 'expected', process_error = FALSE)
all.equal(exp(linpreds), linpreds3)
## [1] TRUE
And finally we can calculate posterior predictions, which consider uncertainty in the Negative Binomial sampling distribution:
ypreds2 <- predict(model1, type = 'response', process_error = FALSE)
all.equal(ypreds, ypreds2)
## [1] "Mean relative difference: 0.8261321"
These values will not be identical due to the random nature
of the sampling distribution. If you are computing predictions /
expectations for new data, the predict.mvgam function
should be your primary choice. But if you merely want to return
predictions / expectations for the training data and/or any testing data
supplied to the model, you will often be better off using the
hindcast.mvgam and forecast.mvgam functions.
For example, here are the hindcasts of the expectations, which may
differ very slightly from those we computed above due to variation in
the floating point arithmetic’s used by Stan and
R:
hc_expectations <- hindcast(model1, type = 'expected')
all.equal(exp(linpreds), hc_expectations$hindcasts$sockeye)
## [1] "Mean relative difference: 1.369103e-05"
As we’ve seen before, these can be plotted in the same general way as
other mvgam plotting functions:
plot(hc_expectations)
The model summary is shown below, which indicates there are no major sampling issues to worry about
summary(model1)
## GAM formula:
## recruits ~ offset(log_spawners)
##
## Family:
## negative binomial
##
## Link function:
## log
##
## Trend model:
## None
##
## N series:
## 1
##
## N timepoints:
## 50
##
## Status:
## Fitted using Stan
##
## Observation dispersion parameter estimates:
## 2.5% 50% 97.5% Rhat n.eff
## phi[1] 1.1 1.7 2.3 1 511
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n.eff
## (Intercept) 0.69 0.9 1.1 1 767
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 1000 iterations ended with a divergence (0%)
## 0 of 1000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
Note though that the hindcast.mvgam and
forecast.mvgam functions simply extract any predictions
that were made in the original model, so will always include uncertainty
in any underlying dynamic components.
Plot a posterior predictive check that is designed specifically for count-valued observations, namely the rootogram. These graphical checks are particularly useful for diagnosing issues such as a model’s failure to capture overdispersion and/or excess zeros in observed count data. In the plot below, we bin the unique observed values into discrete bins to prevent overplotting and help with interpretation. This plot compares the frequencies of observed vs predicted values for each bin, which can help to identify aspects of poor model fit. For example, if the gray bars (representing observed frequencies) tend to stretch below zero, this suggests the model’s simulations predict the values in that particular bin less frequently than they are observed in the data. A well-fitting model that can generate realistic simulated data will provide a rootogram in which the lower boundaries of the grey bars are generally near zero
ppc(model1, type = 'rootogram')
The model generally tends to underpredict. We can confirm this behaviour by inspecting a PIT histogram to again compare observed vs predicted values. For a well calibrated model fit, the PIT will have a standard uniform distribution
ppc(model1, type = 'pit')
The pattern suggests the model is failing to capture some aspects of the dispersion in the data. In general the variation in counts of recruits is not as smooth as the model is predicting, which we can see more clearly by plotting a posterior predictive check of the cumulative distribution function…
ppc(model1, type = 'cdf')
…or even by plotting a posterior predictive histogram:
ppc(model1, type = 'cdf', n_bins = 40)
There are many other types of posterior predictive checks that can be
implemented, and it is highly recommended that you design some to suit
your own specific needs (i.e., what aspect(s) of the data are you most
concerned with your model being able to adequately capture?). See
?mvgam::ppc
Plot Dunn-Smyth residuals and posterior hindcasts to further inspect model fit
plot(model1, type = 'residuals')
plot(model1, type = 'forecast')
Here it is worth quickly revisiting the uncertainty in these
predicted values, and how that differs from uncertainty in our expected
values (calculated above as hc_expectations). Remember that
our goal is to produce a forecast distribution because we do
not know what the future holds and we can therefore consider it to be a
random variable drawn from some probability distribution. Often
we visualise forecasts as lines and ribbons to show particular summaries
of this forecast distribution, but it can be easier to understand where
these summaries come from if we instead view realizations from
this distribution (i.e. random draws). For example, here are 5
realizations from our expected hindcast values…
plot(hc_expectations, realisations = TRUE, n_realisations = 5)
…and here are 20 realizations:
plot(hc_expectations, realisations = TRUE, n_realisations = 20)
Note how there isn’t much uncertainty in these values because all uncertainty is coming from variation in the intercept term \(\alpha\). In sharp contrast, there is larger uncertainty in our hindcast predictions because these predictions also consider variation in the Negative Binomial sampling distribution (including uncertainty in the overdispersion parameter \(\phi\)). As above, here are 5 realizations for the hindcast predictions…
hc_predictions <- hindcast(model1, type = 'response')
plot(hc_predictions, realisations = TRUE, n_realisations = 5)
…and here are 20 realizations:
plot(hc_predictions, realisations = TRUE, n_realisations = 20)
predict vs hindcastBack to the modelling stragy. Although the model is trying to capture the dispersion in the data (the summary above shows that the \(\phi\) parameter is estimated to be very small, indicating a large amount of overdispersion after accounting for the offset), there are still some patterns in the residuals, including substantial autocorrelation at a lag of 1. We need a dynamic model to capture this autocorrelation, so next we add to the model by including an AR1 latent residuals model
model2 <- mvgam(recruits ~ offset(log_spawners),
family = nb(),
trend_model = 'AR1',
data = model_data)
Because this model has a dynamic trend component, it is worth quickly
revisiting the differences between the predict.mvgam
function and the hindcast.mvgam /
forecast.mvgam functions. First, we investigate how
ignoring the uncertainty in the trend component changes the predictions.
In this step, we calculate predictions for the observed time period in
two ways: first we average over the uncertainty from the latent trend,
and second we ignore this uncertainty. Plotting both of these together
shows how our prediction uncertainty changes when we include the
uncertainty in the dynamic trend component. A crucial detail to
understand here is that the predict.mvgam function does not
account for the actual ‘times’ of the data, rather it assumes the
dynamic process has reached stationarity and simply samples from the
following distribution for the trend: \(z_t
\sim Normal(0, \sigma_{error})\)
# calculate predictions for the training data by including the dynamic process
# uncertainty
ypreds <- predict(model2, type = 'response',
process_error = TRUE, n_cores = 2)
# calculate 95% empirical quantiles of predictions
cred <- sapply(1:NCOL(ypreds),
function(n) quantile(ypreds[,n],
probs = c(0.025, 0.975), na.rm = T))
pred_vals <- 1:NCOL(ypreds)
# graphical plotting parameters
layout(matrix(1:2, ncol = 2))
par(mar = c(2.5, 2.3, 2, 2),
oma = c(1, 1, 0, 0),
mgp = c(1.5, 0.5, 0))
# plot credible intervals
plot(1, type = "n", bty = 'L',
xlab = 'Time',
ylab = 'predict.mvgam predictions',
xlim = c(min(pred_vals), max(pred_vals)),
ylim = c(min(cred, na.rm = T),
max(cred, na.rm = T)))
polygon(c(pred_vals, rev(pred_vals)), c(cred[1,], rev(cred[2,])),
col = 'darkred', border = NA)
# now repeat, but this time ignore uncertainty in the dynamic component
ypreds2 <- predict(model2, type = 'response',
process_error = FALSE, n_cores = 2)
cred2 <- sapply(1:NCOL(ypreds),
function(n) quantile(ypreds2[,n],
probs = c(0.025, 0.975), na.rm = T))
# overlay these predictions as a lighter colour
polygon(c(pred_vals, rev(pred_vals)), c(cred2[1,], rev(cred2[2,])),
col = "#DCBCBC", border = NA)
# overlay the actual observations as well
points(model_data$recruits, pch = 16, cex = 0.8, col = 'white')
points(model_data$recruits, pch = 16, cex = 0.65)
box(bty = 'l', lwd = 2)
# now plot the actual hindcasts, which did not assume the process was
# stationary but instead modelled its dynamics over the training time.
# use the same y-axis limits to see how these plots differ
plot_mvgam_fc(model2, ylim = c(min(cred, na.rm = T),
max(cred, na.rm = T)),
ylab = 'Hindcast predictions')
layout(1)
As you can see, the predictions differ. The
hindcast.mvgam and forecast.mvgam functions
use the time component of the data, so it is crucial that any
newdata fed to the forecast.mvgam function
follows on sequentially from the data that was used to fit the model
(this is not internally checked by the package because it might be a
headache to do so when data are not supplied in a specific time-order).
But the predictions from these functions might not be as useful for
scenario-planning because any scenarios would then be required to have
an explicit time dimension. Sometimes we’d rather get a sense of how
predictions might change on average if some external conditions
changed, for example if temperature were to increase by x degrees or if
we were to sample from a new group for a random effect. For exploring
these types of scenarios, the predict.mvgam function is
more useful. This is the function that is used in any calls to
plot_predictions, for example.
As an illustration of the some of the many uses for the
predict.mvgam function, here we can compute how many
more recruits this model would expect to see if there were
5,000 spawners recorded vs if there were only 3,000 spawners recorded.
We can take full advantage of the avg_comparisons function
from the marginaleffects package to do this (see
?marginaleffects::avg_comparisons for details)
# take draws from the model that compute the average comparison between
# spawners recorded at 5,000 vs 3,000 (be sure to log first!)
post_contrasts <- avg_comparisons(model2,
variables = list(log_spawners = log(c(3000, 5000)))) %>%
posteriordraws()
# use the resulting posterior draw object to plot a density of the
# posterior contrasts
post_contrasts %>%
ggplot(aes(x = draw)) +
# use the stat_halfeye function from tidybayes for a nice visual
stat_halfeye(fill = "#C79999") +
labs(x = "(spawners = 5,000) − (spawners = 3,000)", y = "Density",
title = "Average posterior contrast") +
theme_classic()
model1’s posterior intercept
coefficients \(\alpha\). Use
?mvgam::as.data.frame.mvgam for guidancerecruits vs
log_spawners using the supplies model_data.
Take a few notes on whether you think our primary modelling assumption,
that the number of log(recruits) scales linearly with
number of log_spawners, is justified.# Replace the ? with the correct value(s)
# check names of variables that can be extracted
vars_available <- variables(model1)
# we are interested in the observation model coefficients
vars_available$observation_betas
# extract the intercepts using the as.data.frame.mvgam function
# you will have two options for extracting this variable. first, you
# can use the alias
alphas <- as.data.frame(model1, variable = ?)
# alternatively, use the original name
alphas <- as.data.frame(model1, variable = ?)
# plot a histogram
hist(alphas[,1], xlab = expression(alpha),
main = '', col = 'darkred', border = 'white')
Now on to the topic of model comparisons. Plot Dunn-Smyth residuals for this dynamic model to see that there is no more autocorrelation
plot(model2, type = 'residuals')
However we must be cautious interpreting these residuals because the AR1 trend model is extremely flexible. The in-sample fits can sometimes be nearly perfect, meaning the residuals will very often look excellent! We can confirm this behaviour by inspecting hindcasts
plot(model2, type = 'forecast')
PPCs can still give some valid insights into model inadequacy, though we still have to be a bit cautious due the flexibility of the AR1 process
ppc(model2, type = 'rootogram')
ppc(model2, type = 'pit')
These plots show some minor improvement in the PIT (a slightly more uniform plot). But there are some more worrying and pressing matters we can address with this model. Inspect the summary to see what I mean:
summary(model2)
## GAM formula:
## recruits ~ offset(log_spawners)
##
## Family:
## negative binomial
##
## Link function:
## log
##
## Trend model:
## AR1
##
## N series:
## 1
##
## N timepoints:
## 50
##
## Status:
## Fitted using Stan
##
## Observation dispersion parameter estimates:
## 2.5% 50% 97.5% Rhat n.eff
## phi[1] 3.9 17 240 1.48 14
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n.eff
## (Intercept) 0.064 0.57 1.3 1.07 67
##
## Latent trend parameter AR estimates:
## 2.5% 50% 97.5% Rhat n.eff
## ar1[1] 0.31 0.64 0.92 1.03 131
## sigma[1] 0.48 0.73 0.92 1.06 82
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhats above 1.05 found for 13 parameters
## *Diagnose further to investigate why the chains have not mixed
## 188 of 2000 iterations ended with a divergence (9.4%)
## *Try running with larger adapt_delta to remove the divergences
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## Chain 1: E-FMI = 0.1137
## *E-FMI below 0.2 indicates you may need to reparameterize your model
The HMC sampler is telling us there are clear issues when trying to implement both a flexible latent trend AND an overdispersion parameter. Both of these components are competing to capture the extra dispersion in the observations, leading to low effective sample sizes and poor convergence for the overdispersion parameter \(\phi\). It turns out to be very challenging to estimate them both without making some strong adjustments to the model. We now have several options:
Here we will illustrate the latter two options. First, we try the Negative Binomial observation model with a smooth Gaussian Process for the trend
model3 <- mvgam(recruits ~ offset(log_spawners),
family = nb(),
trend_model = 'GP',
data = model_data)
Next we use a Poisson observation model with a AR1 trend
model4 <- mvgam(recruits ~ offset(log_spawners),
family = poisson(),
trend_model = 'AR1',
data = model_data)
Neither of these models has any issues with sampling, which is a good sign!
summary(model3)
## GAM formula:
## recruits ~ offset(log_spawners)
##
## Family:
## negative binomial
##
## Link function:
## log
##
## Trend model:
## GP
##
## N series:
## 1
##
## N timepoints:
## 50
##
## Status:
## Fitted using Stan
##
## Observation dispersion parameter estimates:
## 2.5% 50% 97.5% Rhat n.eff
## phi[1] 1.4 2.1 3.3 1.01 715
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n.eff
## (Intercept) 0.12 0.74 1.2 1.01 499
##
## Latent trend marginal deviation (alpha) and length scale (rho) estimates:
## 2.5% 50% 97.5% Rhat n.eff
## alpha_gp[1] 0.038 0.54 1.1 1.02 273
## rho_gp[1] 1.100 2.80 17.0 1.01 466
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 11 of 2000 iterations ended with a divergence (0.55%)
## *Try running with larger adapt_delta to remove the divergences
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
summary(model4)
## GAM formula:
## recruits ~ offset(log_spawners)
##
## Family:
## poisson
##
## Link function:
## log
##
## Trend model:
## AR1
##
## N series:
## 1
##
## N timepoints:
## 50
##
## Status:
## Fitted using Stan
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n.eff
## (Intercept) -0.22 0.53 1.7 1.02 147
##
## Latent trend parameter AR estimates:
## 2.5% 50% 97.5% Rhat n.eff
## ar1[1] 0.27 0.58 0.90 1 364
## sigma[1] 0.64 0.78 0.99 1 811
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 54 of 2000 iterations saturated the maximum tree depth of 12 (2.7%)
## *Run again with max_treedepth set to a larger value to avoid saturation
## E-FMI indicated no pathological behavior
Hindcasts and trend estimates are very different from the two models
plot(model3, type = 'trend')
plot(model4, type = 'trend')
plot(model3, type = 'forecast')
plot(model4, type = 'forecast')
Here the extreme flexibility of the AR1 trend is apparent in the
hindcasts for model4. Hopefully this gives you more of an
idea about why we can’t put too much faith in residual or fitted value
diagnostics for these flexible dynamic models
Ok so we cannot use in-sample fits or residual diagnostics to guide
reasonable comparisons of these two models. It would be more appropriate
to use 1-step ahead residuals for interrogation. But computing these is
extremely computationally demanding. How then can we compare models?
mvgam implements two different approximators of expanding
window forecast evaluation. The first approximator targets each model’s
ability to simulate temporal dynamics using the single model that has
been fit. Briefly, we must choose a set of timepoints within the
training data to forecast from, allowing us to simulate a situation
where the model’s parameters had already been estimated but we have only
observed data up to the evaluation timepoint and would like to generate
forecasts from the latent trends. This is not a true estimate of h-step
ahead forecasts, because the estimates for the state of the latent
residuals at each timepoint has been informed by data from before this
timepoint AND data from after this timepoint. But it does allow us to
get a sense of which model would have given more reasonable forecasts if
we happened to know what the state of the system was at a given
timepoint in the past. Here we simulate scenarios where we forecast
ahead for the next 3 years. We will use an evaluation sequence covering
19 different evenly spaced timepoints, ensuring a more in-depth
evaluation of each competing model at the same set of timepoints
evaluation_seq <- seq.int(6, 42, by = 2)
Run the approximator for the Gaussian Process / Negative Binomial model. Be aware that these runs can take several minutes to complete, depending on the size and complexity of the model
roll_gp <- roll_eval_mvgam(object = model3,
evaluation_seq = evaluation_seq,
n_samples = 1000, n_cores = 4, fc_horizon = 3)
And now for the AR1 / Poisson model
roll_ar <- roll_eval_mvgam(object = model4,
evaluation_seq = evaluation_seq,
n_samples = 1000, n_cores = 4, fc_horizon = 3)
roll_gp$series_evals$sockeye$all_scores$model = 'GP'
roll_ar$series_evals$sockeye$all_scores$model = 'AR1'
all_score_data <- rbind(roll_gp$series_evals$sockeye$all_scores,
roll_ar$series_evals$sockeye$all_scores) %>%
dplyr::mutate(log_score = log(score),
eval_horizon = paste0('Horizon = ', eval_horizon))
ggplot(all_score_data, aes(x = model, y = log_score,
fill = model)) +
geom_boxplot() +
facet_grid(~ eval_horizon) +
coord_flip() +
labs(y = expression(log[DRPS]),
x = 'Model') +
theme_bw() +
theme(legend.position = 'none')
Design a function to calculate the proportion of comparisons in which a model performed better (gave lower DRPS scores)
prop_better = function(x){
length(which(x < 0)) / length(x)
}
Now calculate the proportion in which the Gaussian Process / Negative Binomial model gave better scores for each forecast horizon. First for horizon 1
prop_better(roll_gp$series_evals$sockeye$all_score %>%
dplyr::filter(eval_horizon == 1) %>%
dplyr::pull(score) -
roll_ar$series_evals$sockeye$all_scores %>%
dplyr::filter(eval_horizon == 1) %>%
dplyr::pull(score))
## [1] 0.5263158
Now for horizon 2
prop_better(roll_gp$series_evals$sockeye$all_score %>%
dplyr::filter(eval_horizon == 2) %>%
dplyr::pull(score) -
roll_ar$series_evals$sockeye$all_scores %>%
dplyr::filter(eval_horizon == 2) %>%
dplyr::pull(score))
## [1] 0.8421053
And for horizon 3
prop_better(roll_gp$series_evals$sockeye$all_score %>%
dplyr::filter(eval_horizon == 3) %>%
dplyr::pull(score) -
roll_ar$series_evals$sockeye$all_scores %>%
dplyr::filter(eval_horizon == 3) %>%
dplyr::pull(score))
## [1] 0.7368421
These summaries are informative. The AR1 / Poisson model tends to
give similar approximate forecasts (lower Discrete Rank Probability
Scores) at a horizon of 1, but does worse at increasing horizons. The
compare_mvgams function can be used to automate this
process by rolling along a set of timepoints for each model. But we have
to be careful interpreting these comparisons, as the AR1 model is
extending out from a well-estimated ‘state’ at each evaluation timepoint
so it has an unfair advantage (the GP trend is smoother and so the
starting ‘state’ for each rolling evaluation is not as unrealistically
perfect). Overall we can’t put too much faith in these comparisons,
unfortunately.
And now for the second approximator, which uses more conventional
leave-future-out comparisons. Time series models are often evaluated
using an expanding window training technique, where the model is
initially trained on some subset of data from \(t = 1\) to \(t =
n_{train}\), and then is used to produce forecasts for the next
\(h\) time steps \((t = n_{train} + h)\). In the next
iteration, the size of training data is expanded by a single time point
and the process repeated. This is obviously computationally challenging
for Bayesian time series models, as the number of refits can be very
large. mvgam uses an approximation based on importance
sampling. Briefly, we refit the model using the first \(min_t\) observations to perform a single
exact \(h~ahead\) forecast step. This
forecast is evaluated against the \(min_t +
h\) out of sample observations using the Expected Log Predictive Density (ELPD; also known as the
log score). Next, we approximate each successive round of expanding
window forecasts by moving forward one step at a time \(for~i~in~1:N_{evaluations}\) and
re-weighting draws from the model’s posterior predictive distribution
using Pareto Smoothed Importance Sampling (PSIS). In each iteration
\(i\), PSIS weights are obtained for
all observations that would have been included in the model if we had
re-fit. If these importance ratios are stable, we consider the
approximation adequate and use the re-weighted posterior’s forecast for
evaluating the next holdout set of testing observations \(((min_t + i + 1):(min_t + i + h))\). This
is similar to the process of particle filtering to update forecasts in
light of new data by re-weighting the posterior draws using importance
weights. But at some point the importance ratio variability will become
too large and importance sampling will be unreliable. This is indicated
by the estimated shape parameter k of the generalized Pareto
distribution crossing a certain threshold
pareto_k_threshold. Only then do we refit the model using
all of the observations up to the time of the failure. We then restart
the process and iterate forward until the next refit is triggered. The
process, which is implemented using the lfo_cv function, is
computationally much more efficient, as only a fraction of the
evaluations typically requires refits (the algorithm is described in
detail by Bürkner et al. 2020).
Run the approximator for the first model, setting
min_t = 20 and using the same forecast horizon as
previously
lfo_gp <- lfo_cv(model3, min_t = 20,
pareto_k_threshold = 0.5,
fc_horizon = 3, n_cores = 4)
The S3 plotting function for these lfo_cv
objects will show the Pareto-k values and ELPD values over the
evaluation time points. For the Pareto k plot, a dashed red line
indicates the specified threshold chosen for triggering model refits.
For the ELPD plot, a dashed red line indicates the bottom 10% quantile
of ELPD values. Points below this threshold may represent outliers that
were more difficult to forecast
plot(lfo_gp)
The top plot (the Pareto k plot) indicates that we only needed one additional refit to give good approximate predictions for the Gaussian Process / Negative Binomial model
lfo_ar <- lfo_cv(model4, min_t = 20,
pareto_k_threshold = 0.5,
fc_horizon = 3, n_cores = 4)
plot(lfo_ar)
The Poisson / AR model, in contrast, needed to be refit many times to ensure the stability of importance-weighted predictions. We would generally expect more refits for this model as the underlying trend has less “memory” than the GP trend, so its forecasts could become irrelevant more quickly as a result.
Compare ELPDs for the two models using all evaluation timepoints when observations weren’t missing by plotting the differences in their scores. This plot will give us an indication if one model is consistently giving better h-step ahead forecasts when trained on successively more data
# determine which timepoints were not missing in the evaluation set
nonmiss_evalpoints <- which(!is.na(model_data$recruits[26:NROW(model_data)]))
# plot the GP ELPD values minus the AR ELPD values
# values will be > 0 if the GP model gave better predictions
# (ELPD is better when larger)
gp_minus_ar <- lfo_gp$elpds[nonmiss_evalpoints] - lfo_ar$elpds[nonmiss_evalpoints]
plot(gp_minus_ar,
ylab = 'GP - AR ELPD',
xlab = 'Time point',
pch = 16,
bty = 'l',
cex = 1,
col = 'white',
xaxt = 'n')
axis(side = 1, at = nonmiss_evalpoints,
labels = lfo_gp$eval_timepoints[nonmiss_evalpoints])
abline(h = 0, lty = 'dashed', lwd = 2)
# show all ELPD comparisons in black
points(gp_minus_ar, pch = 16, cex = 0.85)
# highlight the comparisons for when the GP predictions were better
points(ifelse(gp_minus_ar > 0, gp_minus_ar, NA),
pch = 16, cex = 0.85, col = 'darkred')
The Gaussian Process / Negative Binomial model tends to give better probabilistic predictions throughout the leave-future-out exercise. This result differs sharply from the results from the first approximator above, and is probably due to the fact that the AR1 trend was given an unfair advantage in the first approximator. In-sample residuals for the Gaussian Process / Negative Binomial model look fairly good as well, suggesting we should stick with this model going forward:
plot(model3, type = 'residuals')
If we have enough computational power and our model is simple / small
enough, it can often be worthwhile to compute exact leave-future-out
forecasts. The update.mvgam function makes this relatively
simple to do in mvgam, as all we need to do is supply the
new testing and training datasets and the model will be updated. Below I
show how to do this for our chosen model (model3) by
splitting on timepoint 36 to create training and testing data.
data_train = model_data %>%
dplyr::filter(time <= 36)
data_test = model_data %>%
dplyr::filter(time > 36)
Now refit the model using the update.mvgam function.
Note that you can supply an argument to state lfo = TRUE if
you wish to ignore the computation of residuals and lighten the final
model. This can be helpful if all you plan to do is compare forecasts,
as it will make repeated calls to the update function
faster. See ?mvgam::update.mvgam for more details
model3_exact <- update(model3,
data = data_train,
newdata = data_test,
lfo = TRUE)
Once this model is finished, we are ready to compute exact forecast
evaluations. The first step in this process is to create an object of
class mvgam_forecast using the forecast.mvgam
function:
fc <- forecast(model3_exact)
We’ve seen this class before when calling the
hindcast.mvgam or forecast.mvgam functions.
But we haven’t looked in-depth at the other ways we can use objects of
this class. A primary purpose is to readily allow forecast evalutions
for each series in the data, using a variety of possible scoring
functions. See ?mvgam::score.mvgam_forecast to view the
types of scores that are available. For these data, which have very
large counts, a useful scoring metric is the Continuous Rank Probability Score (CRPS). A CRPS
value is similar to what we might get if we calculated a weighted
absolute error using the full forecast distribution.
scores <- score(fc, score = 'crps')
scores$sockeye
## score in_interval interval_width eval_horizon score_type
## 1 6187.7706 0 0.9 1 crps
## 2 3585.5118 1 0.9 2 crps
## 3 6463.3110 1 0.9 3 crps
## 4 6356.0365 1 0.9 4 crps
## 5 900.0815 1 0.9 5 crps
## 6 1662.9478 1 0.9 6 crps
## 7 2738.9993 0 0.9 7 crps
## 8 NA NA 0.9 8 crps
## 9 NA NA 0.9 9 crps
## 10 NA NA 0.9 10 crps
## 11 NA NA 0.9 11 crps
## 12 NA NA 0.9 12 crps
## 13 NA NA 0.9 13 crps
## 14 NA NA 0.9 14 crps
The returned data.frame now shows the CRPS score for
each evaluation in the testing data, along with some other useful
information about the fit of the forecast distribution. In particular,
we are given a logical value (1s and 0s) telling us whether the true
value was within a pre-specified credible interval (i.e. the coverage of
the forecast distribution). The default interval width is 0.9, so we
would hope that the values in the in_interval column take a
1 approximately 90% of the time. This value can be changed if you wish
to compute different coverages, say using a 60% interval:
scores <- score(fc, score = 'crps', interval_width = 0.6)
scores$sockeye
## score in_interval interval_width eval_horizon score_type
## 1 6187.7706 0 0.6 1 crps
## 2 3585.5118 0 0.6 2 crps
## 3 6463.3110 0 0.6 3 crps
## 4 6356.0365 1 0.6 4 crps
## 5 900.0815 1 0.6 5 crps
## 6 1662.9478 0 0.6 6 crps
## 7 2738.9993 0 0.6 7 crps
## 8 NA NA 0.6 8 crps
## 9 NA NA 0.6 9 crps
## 10 NA NA 0.6 10 crps
## 11 NA NA 0.6 11 crps
## 12 NA NA 0.6 12 crps
## 13 NA NA 0.6 13 crps
## 14 NA NA 0.6 14 crps
Although we would prefer to use distributional forecast evaluations,
point evaluations are also possible using outputs from the
forecast.mvgam function. For example, we can calculate the
Mean Absolute Percentage Error (MAPE), which is a commonly used metric
to evaluate point forecasts when the scales of the time series vary
quite a lot:
# exctract the true values from the test set
truth <- fc$test_observations$sockeye
# calculate mean prediction values for each column in the forecast matrix
mean_fc <- apply(fc$forecasts$sockeye, 2, mean)
# calculate the absolute percentage errors, which are scaled by the
# value of the truths. Note that we only use the first 7 observations
# in the test set because the rest are NA
p <- abs(100*((truth[1:7] - mean_fc[1:7]) / mean_fc[1:7]))
# take the mean of the absolute percentage errors
mean(p)
## [1] 72.99859
Finally, we can look at one more type of score. The
lfo_cv function we used above for approximate
leave-future-out CV currently can only use the ELPD to score forecasts.
This is fine as the ELPD is a strictly proper scoring rule that can be
applied to any distributional forecast, but currently there is no
straightforward way to calculate it for an object of clases
mvgam_forecast like our fc object. But we can
calculate it for all data that was fed to the model (in this case the
training and testing data) using the logLik.mvgam function.
For example, we can calculate the predictive density for our refit like
this:
# calculate predictive density values for each posterior draw for the
# training and testing observations
log_scores <- logLik(model3_exact, n_cores = 2)
This function returns the log(probability) of the observation given
the distributional assumptions (i.e. that the truth is distributed
Negative Binomial with a mean centred around the predicted expectation
and overdispersion given by the estimated \(\phi\) parameter). We can illustrate how
this works with a simple example. Assume our model gave a posterior
expectation of 21 for a particular timepoint. The model has also
estimated an overdispersion parameter of 5, and the true observed value
at this timepoint is 39. The log score is then calculated using the
correct density function (dnbinom in this case; see
?dnbinom for details):
dnbinom(x = 39, mu = 21, size = 5, log = TRUE)
## [1] -4.849416
We would get a higher score (i.e. a better score) if our model’s expectation was closer to the truth:
dnbinom(x = 39, mu = 41, size = 5, log = TRUE)
## [1] -3.860454
In contrast, we would get a lower score (i.e. a worse score) if the expectation was further from the truth:
dnbinom(x = 39, mu = 12, size = 5, log = TRUE)
## [1] -7.979571
The logLik.mvgam function does this for us, computing
the log(probability) scores using the correct distribution for the
model, and returning a matrix of scores (one for each posterior draw *
observation combination). The ELPD for each out of sample observation
can is calculated by taking the mean of these scores. This option is
available in mvgam as well, but note that you must use
type = 'link' when producing forecasts so that expectations
can be calculated. Because of this reason, interval coverages are not
calculated when using type = 'elpd':
fc_linpreds <- forecast(model3_exact, type = 'link')
scores <- score(fc_linpreds, score = 'elpd')
scores$sockeye
## score eval_horizon score_type
## 1 -10.320829 1 elpd
## 2 -9.593766 2 elpd
## 3 -10.257272 3 elpd
## 4 -10.409158 4 elpd
## 5 -8.482106 5 elpd
## 6 -8.879095 6 elpd
## 7 -9.352356 7 elpd
## 8 NA 8 elpd
## 9 NA 9 elpd
## 10 NA 10 elpd
## 11 NA 11 elpd
## 12 NA 12 elpd
## 13 NA 13 elpd
## 14 NA 14 elpd
The pattern in these scores reflects the pattern from the
crps scores above (remember that for ELPD, the goal is to
achieve higher scores).
fc object using the formula provided
by Hyndman and Athanasopoulosmodel3 (i.e. split the data on timepoint 37 rather than
36). Compare CRPS values from this model to the previous refit to see
whether the scores for timepoints 38-43 have changed when conditioning
on the extra information in timepoint 37sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.6 tidybayes_3.0.2 mvgam_1.0.5
## [4] insight_0.19.3 marginaleffects_0.13.0 brms_2.19.0
## [7] Rcpp_1.0.10 mgcv_1.8-41 nlme_3.1-157
## [10] dplyr_1.0.10 downloadthis_0.3.2
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-3 ellipsis_0.3.2 ggridges_0.5.3
## [4] markdown_1.1 base64enc_0.1-3 fs_1.6.1
## [7] rstudioapi_0.14 farver_2.1.1 rstan_2.26.13
## [10] svUnit_1.0.6 DT_0.26 fansi_1.0.3
## [13] mvtnorm_1.1-3 lubridate_1.8.0 bridgesampling_1.1-2
## [16] codetools_0.2-18 splines_4.2.1 cachem_1.0.7
## [19] knitr_1.42 shinythemes_1.2.0 bayesplot_1.9.0
## [22] jsonlite_1.8.4 bsplus_0.1.4 ggdist_3.2.0
## [25] shiny_1.7.2 compiler_4.2.1 backports_1.4.1
## [28] assertthat_0.2.1 Matrix_1.5-1 fastmap_1.1.0
## [31] cli_3.4.0 later_1.3.0 htmltools_0.5.5
## [34] prettyunits_1.1.1 tools_4.2.1 igraph_1.3.4
## [37] coda_0.19-4 gtable_0.3.1 glue_1.6.2
## [40] reshape2_1.4.4 posterior_1.3.1 V8_4.2.1
## [43] jquerylib_0.1.4 vctrs_0.6.2 crosstalk_1.2.0
## [46] tensorA_0.36.2 xfun_0.40 stringr_1.5.0
## [49] ps_1.7.1 collapse_1.9.6 mime_0.12
## [52] miniUI_0.1.1.1 lifecycle_1.0.3 gtools_3.9.3
## [55] klippy_0.0.0.9500 MASS_7.3-57 zoo_1.8-11
## [58] scales_1.2.1 colourpicker_1.2.0 promises_1.2.0.1
## [61] Brobdingnag_1.2-9 parallel_4.2.1 inline_0.3.19
## [64] shinystan_2.6.0 yaml_2.3.6 curl_4.3.2
## [67] gridExtra_2.3 loo_2.5.1 StanHeaders_2.26.13
## [70] sass_0.4.5 stringi_1.7.12 highr_0.10
## [73] dygraphs_1.1.1.6 checkmate_2.1.0 pkgbuild_1.3.1
## [76] zip_2.2.2 rlang_1.1.0 pkgconfig_2.0.3
## [79] matrixStats_0.62.0 distributional_0.3.1 evaluate_0.20
## [82] lattice_0.20-45 purrr_0.3.4 labeling_0.4.2
## [85] rstantools_2.2.0 htmlwidgets_1.5.4 tidyselect_1.2.0
## [88] processx_3.7.0 plyr_1.8.7 magrittr_2.0.3
## [91] R6_2.5.1 generics_0.1.3 DBI_1.1.3
## [94] withr_2.5.0 pillar_1.8.1 xts_0.12.2
## [97] abind_1.4-5 tibble_3.1.8 crayon_1.5.2
## [100] arrayhelpers_1.1-0 utf8_1.2.2 rmarkdown_2.21
## [103] grid_4.2.1 data.table_1.14.4 callr_3.7.3
## [106] threejs_0.3.3 digest_0.6.29 xtable_1.8-4
## [109] tidyr_1.2.1 httpuv_1.6.9 scoringRules_1.0.2
## [112] RcppParallel_5.1.5 stats4_4.2.1 munsell_0.5.0
## [115] bslib_0.4.2 shinyjs_2.1.0